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ABSTRACT 


In previous papers we have reported the N-terminal 40 amino acids of the small subunit of rubisco for samples 
from four families of gymnosperms, nine families of monocotyledons, and 26 families of dicotyledons. We expanded 
this list to 122 families of dicots and derived a phylogenetic tree for all 335 species. The main computing program 
used was HENNIG86, with which a reliable result can be assured with only 17 taxa or less, so a major part of this 
paper is concerned with the strategy adopted to divide the 335 species and then to build the parts into an overall 
tree that is as accurate and objective as possible. Comparison with other taxonomy suggests that, at the level of 
placing genera into families, our methods give results that are at least 90% accurate. At higher taxonomic levels 
accuracy may decrease, and the result should be regarded not as a firm conclusion but as a working hypothesis for 
subsequent testing using the longer sequences from nucleic acids. Topics discussed include heterogeneity within 
species, the nature of the N-terminus of rubisco-SSU, and evidence that natural selection is powerful in determining 
amino acid sequence. The rate of evolution has been shown to vary between major taxa, and data suggest that 


angiosperms originated in the Jurassic. 


The problems of angiosperm phylogeny are well 
illustrated by a consideration of the differences 
between four classifications, all less than a decade 
old and all by highly respected and experienced 
authors. The dicotyledons are divided into six sub- 
classes by Cronquist (1981) and seven by Takh- 
tajan (1983), while, for the other two authors, the 
major groupings are superorders, Thorne (1983) 
having 19 and Dahlgren (1983) 25. The number 
of dicotyledonous orders recognized is, respective- 
ly, 58, 72, 41, and 83; these figures alone indicate 
the resulting diversities of names and content, all 
of which reflect our comparative ignorance of the 
course that evolution has taken in the angiosperms. 
In contrast to this, at the next level down the 
hierarchy, there is basic agreement about the "core" 
families to be recognized (Heywood, 1978). 

Macromolecular sequences provide taxonomic 
characters whose homology over widely diverse 
species can be assumed with some confidence. Se- 
quence data can be analyzed objectively with com- 
puters. We will probably see in the next decade 
the publication of nucleic acid sequences long and 
variable enough to solve some of the problems of 


angiosperm phylogeny (e.g., Palmer et al., 1988; 
Zimmer et al., 1989). It is therefore an appropriate 
time, when nucleic acid sequencing is supplanting 
protein sequencing, to set out the results of a de- 
cade of work that has produced 335 partial protein 
sequences from a wide range of angiosperms. These 
sequences are shorter than nucleic acid sequences 
already published and therefore contain less infor- 
mation and are less able to resolve the sequential 
divergences of early radiations. Nevertheless, we 
believe that our phylogenetic trees will indicate 
likely relationships and profitable working hypoth- 


eses for future investigations. 


A SUMMARY OF PUBLISHED INVESTIGATIONS 
USING PROTEIN SEQUENCES 


The pioneer of the use in botany of protein 
sequences for investigating plant phylogeny was D. 
Boulter of the University of Durham, England. 
During the 1970s, Boulter, along with his col- 
leagues and students, published 25 sequences of 
cytochrome c, 12 complete and 58 partial se- 
quences of plastocyanin and seven sequences of 
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(Canberra), and the Royal Botanic Gardens, Kew. 
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ferrodoxin. These have been collated, with ref- 
erences, by Ramshaw (1982), and Scogin (1981) 
has reviewed the results from the taxonomic point 
of view. Although this work generated much in- 
terest, it also gave rise to skepticism, some of which 
can, with hindsight, be attributed to the inadequa- 
cies of computing methods that were being devel- 
oped concurrently. The mostly unfavorable reac- 
tion of systematists, epitomized by the review of 
Cronquist (1976), influenced the cessation of re- 
search in Boulter’s laboratory about 1980. 

Before this, however, partial sequences (up to 
25 N-terminal amino acids) of the small subunit of 
ribulose-1,5-bisphosphate carboxylase/oxygenase 
(rubisco-SSU) were obtained from six species (Has- 
lett et al., 1976; Strobaek et al., 1976). This work 
led to a complete SSU sequence from spinach (Mar- 
tin, 1979), a forerunner of the work presented 
here which concerns the N-terminal 40 amino acids 
of this protein. (The complete sequencing of a 
protein requires prior purification of several frag- 
ments and is at least an order of magnitude more 
time-consuming than the direct sequencing of the 
N-terminus of the whole protein using an automatic 
sequencer.) Nucleotide sequences of rubisco-SSU 
from a few species have been published, and all of 
them have been studied using our method. The 
only new data comparable to our 334 species are 
from two closely related orchids and their hybrid 
(G. C. Martin et al., 1987). We are unaware of 
phylogenetically useful sequences of other proteins 
since those of Grund et al. (1981) and Nakano et 
al. (1981). 

Work in our laboratory has proceeded in five 
phases. In phase 1 species were chosen because 
Boulter had already published their complete se- 
quences of cytochrome c and partial sequences of 
plastocyanin. When a pattern failed to emerge from 
analyses of these data, we decided to sample each 
family with sequences from at least two more rep- 
resentative genera. Thus, the families Apiaceae, 
Asteraceae, Brassicaceae, Caprifoliaceae, Cheno- 
podiaceae, Fabaceae, Malvaceae, Poaceae, Polyg- 
onaceae, Ranunculaceae, and Solanaceae have each 
been sampled at least three times. These early 
results were published in a series of papers (Martin 
et al., 1983; Martin & Dowd, 1984a, b, c). 

The sequences for rubisco-SSU, cytochrome c, 
and plastocyanin were analyzed for these families 
by Martin, Boulter, and Penny (1985) using de- 
rived estimates of familial node sequences. Anal- 
yses of data from single macromolecules were not 
consistent with one another but, for nine of the 
families, a phylogenetic tree derived from combined 
data remained consistent when ferrodoxin or 5S- 


ribosomal RNA (available for some of the families) 
was added. 

This result indicated the need for longer se- 
quences and better sampling of families. Although 
rubisco-SSU was always multiply represented, in 
17 of the 33 samples of other macromolecules 
there was only a single sequence. This situation is 
precarious because, if the average distance from 
a familial node to a species is N, then on the average 
a single sequence will misrepresent the familial node 
by N. This source of error might be responsible 
for part of the poor agreement observed. Sampling 
a family at least twice, preferably from widely 
divergent representatives should give a better es- 
timate of the familial node (see phase 5). 

In phase 2 we sequenced rubisco-SSU from 11 
members of Onagraceae (Martin & Dowd, 198064), 
15 monocotyledons (Martin & Dowd, 1986b), and 
14 species of Solanum (Martin et al., 1986). We 
reasoned that the reliability of our methods might 
be estimated by comparison with taxonomically well 
understood groups. The results were similar to oth- 
er taxonomic treatments. Additional species of As- 
teraceae were also studied and those results will be 
presented in this paper. 

To estimate the rate of evolution, Proteaceae, 
Solanaceae, Fagaceae, and Winteraceae were sam- 
pled in phase 3 using species whose ancestors are 
thought to have been separated by continental drift 
at known times. This led to a preliminary publi- 
cation (Martin & Dowd, 1984b), and the derivation 
of a molecular evolutionary clock (Martin & Dowd, 
1988), which indicated that on average one nu- 
cleotide difference arose between two diverging 
lines once in seven million years. 

In phase 4 we tested the hypothesis that leghe- 
moglobin had evolved in plants by lateral transfer 
from animals. This led to an investigation of all 
species for which leghemoglobin sequences had been 
published, and it was shown that the pathway of 
evolution in those species was closely parallel in 
hemoglobin and rubisco-SSU (Martin & Dowd, 
1986c), suggesting that there was no need to in- 
voke novel evolutionary processes. A consequence 
of this study was that we increased the number of 
species of Fabaceae sequenced to eight (see Group 
14 below) and obtained sequences from several 
additional families. Many of these were too small 
to be studied in the normal course of this investi- 
gation but were obtained either because they are 
known to include nitrogen-fixers or thought to be 
relatives of the legumes; these include Betulaceae, 
Casuarinaceae, Chrysobalanaceae, Coriariaceae, 
Crossosomataceae, Datiscaceae, Elaeagnaceae, 
Moringaceae, and Myricaceae. 
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In phase 5 we surveyed the dicotyledons which 
increased the number of families studied from 24 
to 124. 


A SURVEY OF THE DICOTYLEDONS 


There are about 250 families of dicots. Because 
it was impractical to sample all of them, a decision 
was made to sample about half, i.e., to increase 
the number from the 24 mentioned above to 124. 
Three families (Acanthaceae, Loranthaceae, San- 
talaceae) failed for reasons that will be discussed 
later. The additional 97 families were chosen pri- 
marily on the basis of size. The majority of families 
sampled have more than 20 genera. To cover as 
wide a range of variation as possible, some small 
families were also sampled. For example, the order 
Illiciales has only three genera, so the family Schi- 
sandraceae (two genera) was chosen to represent 
it. Only three orders are unrepresented out of 
Thorne’s 41 (two of which are parasitic and devoid 
of rubisco), 10 out of Cronquist’s 58, 19 out of 
Takhtajan’s 72, and 21 out of Dahlgren’s 85. 

It is impractical, mainly because computers are 
limited in their capacities to analyze large numbers 
of taxa simultaneously, to contemplate building a 
phylogenetic tree for 122 families (comprising 310 
species) without some subdivision into groups. We 
have done this by referring to all four current 
phylogenies. Thorne (1983) and Dahlgren (1983) 
have superorders as their major groups, the former 
nominating 19 and the latter 25. If these two 
authors agree that families are in the same super- 
order then they have been grouped together in our 
scheme, with one proviso. Takhtajan (1983) and 
Cronquist (1981) have respectively seven and six 
subclasses as their major groups, and these two 
authors have been allowed a veto; if either if them 
does not also agree that families are in the same 
subfamily, then they are left ungrouped. In this 
way we have divided 102 of the studied families 
into 25 Groups, leaving 20 ungrouped because 
there is disagreement. We are reluctant to use a 
formal term like superorder but need to make it 
clear that our use of Group does have a defined 
meaning, so we have used a capital G. The Groups 
are shown in Table 1. 

It was practicable to sample each new family 
only twice, and we have done this by choosing two 
species not only from different genera but, if pos- 
sible, from different subfamilies or tribes. Some- 
times this criterion has broken down because fresh 
leaves have not been available. 

In Table 2 the 335 species for which sequences 
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are available are arranged by families and Groups, 
and their sources and sequences are given. 


BIOCHEMICAL METHODS 


The methods published by Martin and Jennings 
(1983) have stood the test of time, so, rather than 
repeat them here, a general description will be 
given and the few modifications mentioned. 

Two methods were described, one for “pungent” 
leaves with high concentrations of phenolics or 
other substances that make protein purification 
difficult, the other for “bland” species whose leaves 
are much more amenable. The bland method gives 
better quality protein and is therefore to be pre- 
ferred. However, because the pungent method works 
well with bland leaves, but not vice versa, it was 
preferred when there was doubt or too few leaves 
for trial extractions. 

Both procedures started with maceration of about 
100 g of leaves from which the midribs were re- 
moved if practicable. For bland leaves the extract- 
ing buffer was essentially a reducing, saline tris- 
HCI buffer at pH 7.4, while for pungent leaves a 
reducing, saline borate buffer at pH 8.6 and con- 
taining the detergent Triton X-100 was used. After 
crude straining and centrifugation to remove solids, 
the extract was passed through a succession of two 
liquid gel columns. A Sephadex G-25 column was 
used first to remove low molecular weight sub- 
stances. A Sepharose 6B column was used to re- 
move remaining low molecular weight substances 
and high molecular weight nucleic acids and mem- 
brane fragments. Eluting buffers were different for 
the two extraction procedures and for the different 
columns used. The protein was precipitated with 
ammonium sulfate for the bland method and with 
acetone for pungent. Procedures after the second 
column were the same for both types of leaves. 
The protein was S-carboxymethylated at pH 8.6 
to break disulphide bridges between cysteine res- 
idues and then passed through a long column of 
Sephadex G-100 in an eluting buffer containing 
sodium dodecyl sulfate. This separated the large 
subunit from the small subunit, which was precip- 
itated in acetone and dried before sequencing. (A 
variation of this procedure was to use a column of 
G75 followed by G-100.) 

The methods are rather crude but are successful 
because rubisco is a very large protein and, by a 
considerable margin, the most abundant protein in 
leaves. 

About 5 mg of small subunit (in 0.5 ml of water 
without polybrene) was sequenced on the Beckman 
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(1981), Dahlgren (1983), Takhtajan (1983), and Thorne (1983). 
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Families of dicotyledons grouped because they are placed in the same major taxon by all of Cronquist 


GROUP 1 GROUP 4 GROUP 9 
Magnoli Ulm Dipterocarp 
Winter Mor Elaeocarp 
Annon Urtic Tili 
Myristic GROUP 5 Sterculi 
Schisandr Hamamelid Bombac 
Monimi Betul Malv 
Laur Fag Group 10 
Aristoloch Casuarin Viol 
Calycanth GROUP 6 Flacourti 
GROUP 2 Dilleni Datisc 
Berberid Thea Cucurbit 
Ranuncul Ochn Salic 
Lardizabal Clusi Cappar 
Menisperm GROUP 7 Brassic 
Papaver Myric Resed 
GROUP 3 Jugland Moring 
Cabomb Group 8 GROUP 11 
Nymphae Caryophyll Sapot 
Nyctagin Styrac 
Amaranth Primul 
Phytolacc Myrsin 
Chenopodi 
FAMILIES THAT DO NOT FIT INTO ONE OF THE GROUPS 
Aster Coriari Goodeni 
Bux Crossosomat Hydrophyll 
Campanul Elaeagn Lecythid 
Chrysobalan Euphorbi Loas 


Note: “-aceae” omitted from all names. 


890C automatic sequencer using Beckman's stan- 
dard quadrol program with 50% quadrol buffer. 
The phenylthiohydantoin (PTH) derivatives of the 
amino acids were identified using a Waters HPLC 
instrument with a C- 18 radially compressed column 
and eluted with 0.1M sodium acetate (pH 6.0) and 
acetonitrile. This did not distinguish two pairs of 
amino acids and was therefore supplemented with 
TLC. 

Using these methods, we could, without assis- 
tance, produce two proteins each week and se- 
quence two others. 


FAILURES 


Although 90% of attempts led to successful se- 
quences, the remaining 10% deserve brief atten- 
tion. Unless there was an identified reason for fail- 
ure that could be corrected, our policy was to try 
another representative of the family. 

Faults that could be corrected include the 
amounts of extraction and elution buffers used. 
Some plants gave extracts that were mucilaginous 
to the point of setting solid. Dilution of the extract 


GRouP 12 GRouP 17 Group 21 
Eric Connar Lami 
Epacrid Sapind Verben 
GRouP 13 Anacardi Group 22 
Cunoni Simaroub Solan 

Ros Meli Convolvul 
Saxifrag Rut Polemoni 
Group 14 Group 18 Group 23 
Caesalpini Halorag Scrophulari 
Mimos Rhizophor Gesneri 
Papilioni Group 19 Bignoni 
Group 15 Zygophyll Pedali 
Trap Gerani Group 24 
Lyth Tropaeoli Valerian 
Myrt Malpighi Caprifol 
Punic Group 20 GRouP 25 
Onagr Logani Api 
Melastomat Gentian Arali 
Combret Apocyn 

Group 16 Asclepiad 

Olac Ole 

Celastr Rubi 

Nelumbon Polygon Thymelae 
Piper Prote Vit 
Plumbagin Rhamn 


corrected this. This problem occurred in Onagra- 
ceae and a few others with small leaves containing 
a high proportion of veins. Insolubility of the pro- 
tein, leading to precipitation in columns, could 
sometimes be corrected by loading a more dilute 
extract. Plants with C4 photosynthesis, and rubisco 
tightly bound in bundle sheaths, were avoided if 
possible. Plants with C3 photosynthesis often occur 
in the same genera or families and were unlikely 
to be phylogenetically biased. However, if unavoid- 
able (e.g., Welwitschia is reported to be C4), spe- 
cial care was taken during the maceration process. 

It is suspected that the most common cause of 
failure was the presence of powerful proteases in 
the leaves and, in retrospect, it would have been 
profitable to try correcting this with research early 
in the project. Species of Ficus, known to have 
leaf proteases, showed symptoms of this failure. 
Large amounts of protein traveled where the small 
subunit should have been on the G-100 column 
and gave many amino acids at each position when 
sequenced. Another casualty of this sort was Gne- 
tum gnemon, which was particularly desired be- 
cause it is a gymnosperm thought to be close to 
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angiosperm ancestors. All three members of Acan- 
thaceae that were tried failed with symptoms like 
these, as did four out of six species from Caesal- 
piniaceae. 

Finally, an entirely different sort of failure oc- 
curred with four species, all hemiparasites from 
the putatively related families Santalaceae and Lo- 
ranthaceae. These species had abnormally high 
amounts of phenolics, but it seems unlikely that 
failure can be attributed to them or to any of the 
other causes mentioned above. The preparations 
always yielded abnormally high amounts of plas- 
tocyanin but no trace of rubisco-SSU. Plastocy- 
anin, a chloroplast protein, has a molecular weight 
sufficiently close to rubisco-SSU that it occurs, 
occasionally, as a small contaminant detected dur- 
ing sequencing. It could be identified by its se- 
quence but, except in these two families, it was so 
weak that it disappeared after about seven posi- 
tions. The strength of the plastocyanin sequence 
in all four of these hemiparasites suggests that the 
absence of rubisco-SSU could not be ascribed to 
some general difficulty like proteases, but might 
reflect an unusual, perhaps facultative, photosyn- 
thetic system. 


GENERAL REMARKS ABOUT THE SEQUENCES 
OVERALL VARIATION AND INVARIANT SITES 


A summary of the variation that we have ob- 
served is given in Table 3; the amino acids most 
commonly observed are in the top line. 

The rubisco-SSU gene includes two introns, the 
first of which is inserted before the codon that 
determines amino acid 3. It determines valine and 
this, like tryptophan at position 4, is invariant, the 
two codons carrying the signal to cut the end of 
the intron (Berry-Lowe et al., 1982). These in- 
variant residues were useful early signals that the 
correct protein fraction had been chosen. Within 
the first 40 amino acids, proline always occurs at 
position 5 and/or 6, at position 19 and/or 20, 
and at position 40. These three regions correspond 
to bends in the tertiary structure of the molecule. 
Chapman et al. (1988) have indicated that between 
the first and second bend there is alpha-helix and 
thereafter beta-sheet. There is an almost invariant 
region from amino acids 13 to 18, a region that 
makes contact with one of the large subunits (Chap- 
man et al., 1988; Knight et al., 1989). The only 
variation we have found in this region is the sub- 
stitution of phenylalanine for leucine at position 15 
in five species of Solanum (Martin et al., 1986). 
These same species also have phenylalanine sub- 
stituted for leucine at the almost invariant position 
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21. The simultaneous occurrence of two very rare 
substitutions indicates a causal connection. Hydro- 
phobic bonding between the two positions may sta- 
bilize the bend at position 19 and, because these 
species are inhabitants of very hot and arid regions, 
this may have been a factor in natural selection. 


HETEROGENEITY WITHIN SPECIES 


The first reports of rubisco-SSU sequences (Stro- 
baek et al., 1976) were for the N-terminal amino 
acids in species of Nicotiana, and these showed 
heterogeneity at positions 7 and 8 in tobacco. We 
have also found it at position 30. These hetero- 
geneities are undoubtedly associated with the am- 
phidiploid origin of tobacco and led to the expec- 
tation that heterogeneity would be fairly common, 
not only because about one-third of plant species 
are polyploid, but also because in diploids gene 
duplications are frequent. We may not have de- 
tected some heterogeneities (for example, those 
involving serine, which gives a weak signal), but 
we did detect 34 species with one heterogeneity, 
11 with two, and 4 with three. The demonstration 
by Pichersky et al. (1986) that in tomato there 
were at least three different DNA messages for 
rubisco-SSU, all with the same N-terminal amino 
acid sequence, suggests that selection acts strongly 
to preserve primary amino acid structure. There 
are at least eight different genes encoding rubisco- 
SSU in petunia (Lamb & Fitzmaurice, 1986); for 
this reason, when we prepared protein from that 
species, we used a mixture of equal quantities of 
leaves from four morphologically different varie- 
ties, with the aim of finding heterogeneities (Martin 
& Dowd, 1984b). The sequence was of high quality 
but no heterogeneity was detected. Likewise, we 
chose to study Rhoeo discolor because it is a 
complex interchange heterozygote for all chro- 
mosomes and might therefore be heterozygous for 
rubisco-SSU, but we detected no heterogeneity. 
Heterogeneities that were found presented no prob- 
lem for the computer analysis. 


INSERTIONS 


Only two examples of additional amino acids in 
the N-terminal sequence have been found. Both 
species of Epacridaceae that we studied had an 
additional isoleucine between normal positions 9 
and 10. Teucrium flavum (Lamiaceae) had two 
additional glycines, probably between the same two 
positions. These insertions, while clearly of taxo- 
nomic significance, have been ignored during data 
processing. 
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THE N-TERMINUS 


Haslett et al. (1976) reported that the N-ter- 
minus of rubisco-SSU was “frayed,”” some mole- 
cules seeming to have methionine in position 1 
while others are without it. This is the situation 
that we have encountered in the vast majority of 
species, the effect being that at every position two 
amino acids are recorded, the correct one and the 
next one. Usually the two signals are approximately 
equal, especially when the protein is of highest 
quality. This property is helpful in that it provides 
a second opportunity for identification and is useful 
for identifying minor contaminating proteins whose 
residues appear only once, but probably means that 
it is more difficult to obtain long sequences because 
attenuation of the signal occurs earlier. 

All those species for which nucleic acid sequenc- 
es have been reported were also studied by us and 
all show fraying. Because the nucleic acid sequenc- 
es show the N-terminus to be methionine, there is 
no doubt. The signals we obtained were not typical 
for either methionine or its sulfone derivative. 
Whether the derived amino acid is obtained by 
dansylation (in manual sequencing) and identified 
by TLC, or is the PTH derivative from automatic 
sequencing and identified by HPLC or TLC, the 
N-terminal amino acid moves differently from me- 
thionine; therefore, we conclude that it is a modified 
form of that amino acid. 

Two exceptions to the above generalization have 
been encountered. In 10 out of 11 species of the 
Onagraceae the N-terminus is phenylalanine, the 
only variations from methionine known, and in 
these there is no sign of fraying. In six other species 
the N-terminal amino acid is methionine (and gives 
the normal signal for PTH-methionine), but there 
is no sign of fraying, the difference from the ma- 
jority of species being sharp and unmistakable; 
these are two members of Papaveraceae (Papaver 
orientalis and Eschscholtzia californica), two from 
Pedaliaceae (Sesamum indicum and Ceratotheca 
triloba), Vitex lucens (Verbenaceae), and Mentze- 
lia lindleyi (Loasaceae). Any hypothesis to explain 
fraying must account for these exceptions, and we 
believe that they exclude artifacts arising from 
techniques of protein production or sequencing. 
Any hypothesis must also account for the modifi- 
cation of methionine and the equality of the two 
forms of the protein. We therefore dismiss as un- 
likely hypotheses relying on inefficient shortening 
of the protein either as it passes through the chlo- 
roplast membrane or after entry. 

It is known that rubisco-SSU forms dimers (Roy 
et al., 1978), and we suggest that this may be 
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through formation of disulphide bridges between 
the N-terminal methionines of two SSU molecules. 
This might occur in vivo if the enzyme model of 
Chapman et al. (1988) is correct, but is more likely 
an in vitro event if the different model of Knight 
et al. (1989) is correct. If dimers are formed, 
S-carboxymethylation at pH 8.6 would not break 
an inter-methionine bond, but we suggest that the 
dimer does fall apart so that dimethionine is on 
one chain and no methionine on the other. This 
hypothesis would account for all phenomena except 
for non-fraying species, which presumably do not 
naturally form dimers. 


METHODS OF DATA ANALYSIS 


Before computer analysis, amino acid sequences 
were converted to inferred nucleotide sequences 
using the genetic code. Usually this could be carried 
out after inspection of, for example, all the se- 
quences in a Group so that the most parsimonious 
choices of codons could be made. A standard was 
chosen at sites where substitution was silent. Al- 
though a program was available (Martin et al., 
1983), usually the path was obvious and computing 
unnecessary. Thus the unit of length in phyloge- 
netic trees is an inferred nucleotide difference 
(i.n.d.). 

The number of dichotomizing trees (phyloge- 
netic Steiner trees) connecting N taxa is 1 x 3 x 
5 .... (2N-5). The principle of analysis is that 
the length of every possible tree is calculated and 
the shortest tree is chosen as the most probable. 
This agrees with the parsimonious hypothesis that 
evolution has proceeded by the shortest route. 
However, because the total number of possible trees 
increases very rapidly, i.e., when increasing from 
N-1 to N taxa it increases (2N-5)-fold, it is not 
always possible to consider every tree. 

Except during the final stages of this project, 
the program that we used was MINTREE, the 
"branch and bound" program of Hendy and Penny 
(1982). With a Vax 785 computer the usual limit 
for simultaneous analysis was 12 taxa. This limit 
could be extended to about 15 with a supercom- 
puter (Cyber 205), but the trouble and expense 
precluded useful work. Although MINTREE has 
now been superseded, its co-program ANALYZE 
is still used because it possesses efficient routines 
for obtaining ancestral sequences and internodal 
lengths. 

Most of the analyses have been carried out using 
HENNIG86 (Farris, 1988) and a personal com- 
puter (Microbyte 230). This system is about two 
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orders of magnitude faster than the above and has 
the additional advantage that an analysis can be 
left running for days or weeks. The principles of 
its algorithms have not been published, but the 
time of its release suggests a possible connection 
with a published letter by Johnson (1987). 
HENNIG86 offers a number of options, which com- 
parison with MINTREE using the same sets of data 
suggest are reliable; in order of preference we have 
used implicit enumeration (ie*); ie followed by bb 
(branch swapping); mhennig followed by bb. Be- 
yond the number of taxa that can be handled by 
MINTREE or the ie option of HENNIG86, correct 
solutions cannot be guaranteed. 

A further advantage of HENNIG86 is that it 
includes a program for successive weighting, which 
often reduces the result to one or very few trees. 
This usually eliminates the need to derive consensus 
trees, a process that we have found unsatisfactory 
(Martin & Dowd, 1989). Finally, HENNIG86 de- 
rives the ci (consistency index) (Carpenter, 1988) 
and ri (retention index) which we record with each 
figure of a tree. In a personal communication, 
Farris defines these as follows: if r and m denote, 
respectively, the smallest and greatest number of 
steps that a character can require on any tree, and 
s denotes the number of steps that character re- 
quires on a considered tree, then c.i. is r/s and 
r.i. is (m-s)/(m-r). 

MINTREE uses data such that each of the four 
nucleotides is entered as 1, 2, 4, or 8, which allows 
the counting of ordinary differences and also of 
heterogeneities; no matter what variation occurs 
at a site, it can be recorded as a sum that is always 
different for different combinations. Provided there 
are no heterogeneities, HENNIG86 can use the 
same notation (using the nonadditive option); if 
there are heterogeneities, they must be recorded 
by inserting additional taxa. This is satisfactory if 
there is only one variable site within a taxon when 
only two taxa need to be recorded. Assumptions 
of linkage must be made, however, if there is more 
than one variable site but only two taxa are to be 
recorded. This problem becomes increasingly im- 
portant as an analysis progresses from using raw 
data to derived ancestral sequences for families 
and then Groups because, in these, heterogeneities 
may be numerous. We have therefore used alter- 
native strategies. The first is inserting additional 
taxa as just described and accepting the result if 
the different versions of a taxon cluster without 
interruption. 

The second strategy is to use binary coding for 
nucleotides; e.g., 1000 for A, 0100 for G, 1100 
for A and G, 0010 for C, and so forth. In con- 
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junction with HENNIG86, MINTREE notation is 
slower than binary notation, which is therefore 
advantageous. As would be expected, binary gives 
a tree length double that using MINTREE notation 
but it is seldom exact. If inexact, the length is 
always less than double, and there is a loose re- 
lationship between the deficit and the number of 
heterogeneities. Because the details of HENNIG86 
have not been published, we have been cautious 
about choosing between these alternatives and have 
done all Group analyses with HENNIG86 using 
both notations. 

We structured our investigation such that the 
majority of families were represented by at least 
two species from different genera. If we accept 
that taxonomy is seldom wrong when placing gen- 
era within families (Heywood, 1978), then we have 
an empirical way of judging the merits of the two 
notations. Omitting families that have either a sin- 
gle representative or are multiply represented, all 
Groups have been analyzed using both notations 
and, from the minimal trees recorded, we have 
chosen the best as judged by pairing of represen- 
tatives of families. In eight Groups both methods 
had the same best tree, in six binary gave the best, 
and in eleven MINTREE notation gave the best. 
In the section **Analyses Within Groups" we have 
therefore used the taxonomically best minimal tree 
no matter which notation was used to derive it. 
However, in later sections we have used binary 
notation exclusively because it is quicker and more 
convenient. 

Among best trees, 79% of families showed cor- 
rect pairing of its members. When judging this 
result, it should be remembered that a single mis- 
placed species will often result in the failure of 
pairing of representatives of two families. While a 
few such occurrences may be the result of incorrect 
taxonomy, the remainder are presumably caused 
by convergent evolution. The details can be seen 
in the figures for the Groups. 


ANALYSES WITHIN GROUPS 
EXPLANATION OF THE FIGURES 


The figures are drawn to scale, which is indicated 
by the length of one inferred nucleotide difference 
(i.n.d.). Only lengths have meaning, not angles. 
Usually at least two trees have been derived for 
each Group, one using sequences of individual spe- 
cies and the second using derived familial nodes. 
If the first shows congruent grouping of putative 
members of families, then the second is not needed. 
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Disruption of familial grouping is often caused by 
the sole representative of another family or by a 
member of a multiply represented family. The sec- 
ond tree is derived from familial nodes and single- 
tons, is drawn with a different scale, and uses only 
the three-letter familial abbreviation of Weber 
(1982), which is given in Table 2. When appro- 
priate, trees of multiply represented families or 
genera are also given. In later analyses, Group 
nodes, and often one or two others, will be used; 
each of these is numbered. 


The Base of the Angiosperm Tree 


Before detailed analysis of Groups began, many 
analyses were done using the five gymnosperms, 
representatives of the monocotyledons and of 
Groups 1, 2, and 3 which were most likely, on 
taxonomic grounds, to be near the root of the 
dicotyledon tree. The angiosperm family closest to 
the gymnosperms was Schisandraceae. Figure 1 
shows the junction of the gymnosperms, Schisan- 
draceae, and the other angiosperms. The derived 
sequence of this node has been used as “Base” in 
all subsequent Group analyses. 

It will be noted that Figure | is different from, 
and taxonomically more satisfactory than, the 
equivalent figure of Martin and Dowd (1989). Since 
then a sequence of Welwitschia has been obtained 
and this paired with Ephedra between the angio- 
sperms and the other gymnosperms. Three at- 
tempts to study Gnetum were made but all failed 
with symptoms suggesting strong leaf protease ac- 
tivity. 
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FIGURE 1. Five gymnosperms analyzed with familial 
nodes of five angiosperm families from Groups 1, 2, and 
3. The ancestral sequence derived for the junction of 
Schisandraceae has been used as an outgroup for ana- 
lyzing the Groups of dicotyledons. 


Group 1. An attempt to study Hedycarya 
(Monimiaceae) having failed, Peumus was left as 
a singleton, which was therefore omitted from Fig- 
ure 2a but included in Figure 2b. Correct pairing 
and grouping occurs in all families except Aristo- 
lochiaceae for which a derived familial node is 
shown in Figure 2b. As indicated above, Schisan- 
draceae is nearest to Base with a rather long gap 
to the remainder. 


Group 2. In contrast to the straightforward- 
ness of the previous Group, Group 2 has presented 
problems that arise from the great intrafamilial 
variation of Ranunculaceae and Papaveraceae, the 
trees for which are shown in Figure 3b and c. The 
two derived familial nodes are shown with the rest 
of the Group in Figure 3a. We interpret Ranun- 
culaceae and Papaveraceae to be ancient angio- 
sperm families, and only with some misgivings have 
we adhered to our acceptance of current taxonomy 
at the levels of family and below. This is especially 
so for Papaveraceae, but splitting off Fumariaceae 
creates more problems than it solves. 
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FIGURE 2.—(A). Group 1, omitting the single repre- 
sentative of Monimiaceae. —(B). Tree of family nodes for 
Group 1. 


Volume 78, Number 2 


1991 
MENISPERMACEAE 
Cocculus 
ly ypserpa 
ü 
o 
< 
= 
? 
m 
r 
(A) Lardizabala > 
o 
a 
š 
O 
RANUNCULACEAE m 
> 
Ç m 
Decaisnea 
PAPAVERACEA 
SCALE 
fing, 191 Rig2 
Aquilegla 
Helleborus 
Papaver 


Eschscholtzi 


Clematis viticella 


(B) 


(C) 
Dicentra 


RANUNCULACEA 
n 


SCALE cae 
— Ri 88 : 
indi Ci 92 i 


lematis rehderiana SONE oi 87 Ri 86 


1 
FicURE 3. (A) Group 2, with family nodes, derived 


from (B) and (C) for Ranunculaceae and Papaveraceae 
s.]. 


Nymphaea Ts 
“e 
4c 
S4 


CABOMBACEAE 


Victoria Nuphar Brasenia 


: SCALE 
| J 


1 i.n.d. 


Ci 100 Ri 100 


FIGURE 4. Group 3. 


Fumaria 


Martin & Dowd 317 
Angiosperm Phylogeny Using Protein 
Sequences 

Liriope Ruscus 


SMILACACEAE 


Asparagus 


POACEAE 


ARECACEAE 


COMMELINACEAE 


Cymbidium 


Sagittaria 


Aponogeton 


BASE of MONOCOTYLEDONS 


| SRA Ci 88 Ri 80 


FIGURE 5. The monocotyledons that have been stud- 
led with some families represented by their nodes. 


Group 3. Failure with a species of Cabomba 
has left Brasenia as a singleton that does not 
separate from the three members of Nymphae- 
aceae; however, the internode is so short that we 
draw no significance from it (Fig. 4). 


Piperaceae, Nelumbonaceae, and the mono- 
cotyledons. Piperaceae and Nelumbonaceae have 
not been placed in a Group because in both cases 
opinions differ among the four phylogenies consid- 
ered when we nominated Groups. All place them 
in either our Group 3 or Group 1, so it is appro- 
priate to carry out a joint analysis with the members 
of these two Groups, and at the same time to 
consider the links with the monocotyledons. Al- 
though no species have been added to those re- 
ported earlier (Martin & Dowd, 1989), the se- 
quences have been reanalyzed with HENNIG86. 
To reduce the number of taxa to be compatible 
with the ie option, familial nodes have been used 
for Araceae, Arecaceae, Commelinaceae, Poaceae, 
and Smilacaceae (Fig. 5). The monocotyledon node 
has been derived and is included in Figure 6. The 
result is different from that of Martin and Dowd 
(1989), and this is presumably because of the new 
computing program. 
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Group 4. There is good pairing between mem- order when only familial nodes are used. However, 


bers of Ulmaceae and Urticaceae but not Moraceae 
(Fig. 7). It was only after failures with two species 
of Ficus and one of Maclura, almost certainly due 
to protease activity, that we supplemented with 
Humulus, knowing that its taxonomic position was 
not entirely clear. The failure of Humulus and 
Morus to pair was therefore not surprising. Sub- 
sequently, Humulus was removed from Group 4 
and added to the list of uncertain taxa (see below). 


Group 5. The tree of Nothofagus (Fig. 8b) is 
slightly different from that of Martin and Dowd 
(1988) because it is influenced by the weighting 
procedure of HENNIG86 and because it includes 
Fagus, which does not separate. From this tree a 
node has been derived and used in Figure 8a. While 
Betulaceae, Casuarinaceae, and Hamamelidaceae 
have correct grouping, the junction with Base di- 
vides /Vothofagus from Quercus. 


Group 6. Figure 9 differs from one already 
published by Martin and Dowd (1989); the two 
trees are of the same length but this one is preferred 
because it shows perfect pairing and reflects the 


the other probably conforms better with taxonomy 
in that Dilleniaceae is separate from the other three 
families. 
Group 7. Figure 10 shows that the two rep- 
resentatives of Juglandaceae pair leaving Myrica, 
the sole representative of Myricaceae, separate. 
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FIGURE 11. —(A). Group 8. —(B). Group 8 family nodes. 


Group 8. The “Centrospermae” is one of the 
most unsatisfactory groups with representatives of 
two families, Chenopodiaceae and Amaranthaceae, 
failing to form pairs (Figure lla). Spinacia and 
Beta have identical sequences, but these are quite 
different from Chenopodium. The tree for family 
nodes is in Figure | 1b. 


Group 9. There is a marked difference be- 
tween the two representatives of Tiliaceae; Grewia 
is at the bottom of the tree (Fig. 12), while Spar- 
mannia disrupts the clustering in Malvaceae. How- 
ever, the remaining four families are satisfactory. 
Grewia was removed from Group 9 and added to 
the list of uncertain taxa. (As will be mentioned 
later, it subsequently rejoined.) 


Group 10. While Violaceae, Cucurbitaceae, 
Salicaceae, Brassicaceae, and Flacourtiaceae 
formed good clusters (Fig. 13a), the two represen- 
tatives of Datiscaceae (Datisca and Tetrameles) 
were very different. Attempts to study Capparis 
having failed, Cleome was left unpaired so we chose 
Reseda from the putatively related family Rese- 
daceae. Since these two did group, we did not seek 
correct partners for them. In addition to these two 
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FIGURE 12. Group 9. 


singletons, Moringa represents a monogeneric 
family. A tree from family nodes is shown in Figure 


13b. 


Group 11. 


The representatives of all four fam- 
ilies form pairs (Fig. 14), Myrsinaceae adjacently 
and the other three families dichotomously. 


Group 12. As mentioned earlier, the two spe- 
cies of Epacridaeae are distinguished by having an 
additional amino acid inserted in their sequences. 
Although this could have been used as a character, 
it was unnecessary because the two species paired 
separately from the two Ericaceae species (Fig. 


15). 


Group 13. When family nodes are derived for 


Rosaceae, Cunoniaceae, and Saxifragaceae, they 
are very close (Fig. 16b), so it is not surprising 
that there is confusion when individual species are 
analyzed (Fig. 16a). The representatives of Rosa- 
ceae pair correctly, however. 
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Group 14. 


Among minimal trees derived when 
all legume species are analyzed simultaneously, 
there are some in which the two Mimosaceae spe- 
cies pair and so do the two Caesalpiniaceae; how- 
ever, the eight Papilionaceae species are confused. 

: Se ¡ SCALE 
We have therefore derived a Papilionaceae node Oi es BLTS Lem 
separately (Fig. 17b) and show this with the other E 


two families (Fig. 17a). FIGURE 15. Group 12. 
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FIGURE 17. (A) Group 14 with Papilionaceae rep- 


Lopezia resented by a node derived from (B). 
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FIGURE 18. (A) Group 15 omitting Trapa and Punica, which are included with family nodes in (B). (C) Onagraceae. 
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it could be due to the inclusion of new families, 
the earlier choice of inappropriate outgroups, or 
the new analytical methods. 


Cassine 
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dÉ Strombosia 
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Group 16. As discussed earlier, all represen- 
Y,  tatives of the hemiparasites of the Santalales and 
Loranthaceae failed to yield protein samples, so 

Euonymus this Group is reduced to Olacaceae and Celastra- 
ceae in which pairings are straightforward (Fig. 


19). 
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Group 17. This Group is not very satisfactory 
possibly because, as indicated in Figure 20b, there 
has been a rapid radiation. The consequences are 
that the members of Simaroubaceae and Sapin- 
daceae do not pair, while Flindersia, sometimes 
excluded from Rutaceae, does not group with the 
¡RAEE Gí ee Ri 97 other two representatives of that family. However, 
; din. : Hes 
i there is good pairing for Connaraceae and Ana- 
FIGURE 19. Group 16. cardiaceae (Fig. 20a). Melia having failed, Cedrela 

is left as the sole representative of Meliaceae. 


Group 18. The two members of Haloraga- 
ceae, Gonocarpus and Haloragodendron, are so 
confounded with the three members of Rhizopho- 
raceae (Fig. 21) that there was no point in deriving 
family nodes to derive a Group node. 


Group 15. This Group, which corresponds to 
the order Myrtales, was discussed by Martin and 
Dowd (19862). Since then only Trapa and Punica, 
both singletons, have been added. When the rep- 
resentatives of the other five families are analyzed, 
pairing is good except in Lythraceae (Fig. 18a). 
The three members of Onagraceae in this tree are 
from the bottom of the family tree (Fig..18c). When 
family nodes are analyzed (Fig. 18b), the root of 
the tree is in a different place from the one pre- 
viously published; it is uncertain why this is so, but 


Group 19.  Nitraria having failed, Zygo- 
phyllum is a singleton as is Tropaeolum, for which 
no partner was available. As shown in Figure 22a, 
the members of Geraniaceae and Malpighiaceae 
pair. The family node tree is shown in Figure 22b. 
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FIGURE 20. (A) Group 17 omitting Melia, which is included with family nodes in (B). 
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FIGURE 21. Group 18. 


Group 20. Hoya was left a singleton by failure 
to extract protein from two other members of As- 
clepiadaceae, Asclepias and Cryptostegia. The 
members of four families showed dichotomous pair- 
ing while Logania paired alongside Buddleia, which 
is only possibly a member of Loganiaceae (Fig. 23). 


Group 21. The representatives of Lamiaceae 
and Verbenaceae were very similar but there was 
nevertheless a minimal tree in which congruent 


pairing occurred (Fig. 24). 
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Group 22. There have been previous reports 
of Solanum (Martin et al., 1986) and Nicotiana 
(Martin & Dowd, 1984b). Using HENNIG86, new 
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Teucrium nodes have been derived for both genera (Fig. 25b, 
c) and were used, with Anthocercis, to represent 
& Solanaceae in Figure 25a. These group well but 
e there is confusion between the representatives of 
RS Convolvulaceae and Polemoniaceae. 
> Group 23. As mentioned earlier, all attempts 
to extract rubisco from Acanthaceae species failed. 
The representatives of the other four families of 
i this Group pair well, Scrophulariaceae, Gesneri- 
Phlomis aceae, and Bignoniaceae dichotomously and Pe- 
daliaceae adjacently (Fig. 26). 
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CAPRIFOLIACEAE Group 24, The three members of Caprifoli- 
Sambucus aceae are substantially different from the two mem- 
bers of Valerianaceae so that correct grouping is 
observed (Fig. 27). 
Group 25. 
Viburnum 


This Group is unusual in that Api- 
um and Foeniculum of Apiaceae have identical 


sequences as do Schefflera and Fatsia of Arali- 


aceae. Consequently, the tree of this Group (Fig. 
28) is very simple. 
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THE DERIVATION OF A TREE FOR THE 
GROUPS OF DICOTYLEDONS 


FIRST STAGE; A TEST OF THE 
REALITY OF GROUPS 


Depending on the size and complexity of the 
Group, one, two, or three nodes have been marked 
near the bases of each Group tree; altogether there 
are 58 basal nodes and the ancestral sequence of 
each has been derived using ANALYZE. These 
have been used for a test of the reality or integrity 
of the Groups. If a family does not really belong 
to a Group, it should usually behave like an out- 
group and assume the position closest to the base 
of the tree. Thus, in a simultaneous analysis of all 
58 basal nodes, it would be expected that nodes 
truly belonging to the same Group should cluster 
together. If a family is misplaced in a Group, the 
nodes should separate. 

The only program that can be used with 58 taxa 
simultaneously is HENNIG86 with the option 
mhennig followed by bb. This was done three times, 
each yielding large numbers of trees for which strict 
consensus trees were derived. Inspection indicated 
that most Groups behaved as if they were real, but 
some separation of nodes occurred in Groups 5, 
8, 14, 15, 22, and 24 (see below). It is unlikely 
that this sort of analysis would give a completely 
reliable result, but our interpretation is that where 
there is no separation of within-Group nodes, that 
Group should be accepted as valid. We understand 
that our test is not infallible, but we are reluctant, 
at this stage, to attempt another obvious test, viz. 
the simultaneous analysis of adjoining Groups. This 
test was used earlier with Groups 1, 2, and 3 and 
led to considerable mixing of the first two. The 
amount of convergent evolution between Groups 
is probably such that, if this test were applied 
widely, confusion would result. Therefore, even 
though we understand the limitations, we confine 
our testing of the integrity of Groups to one sort 
of analysis. 

For the six Groups where there was doubt, we 
applied the test devised by Lake (1987). This is 
confined to four species, A, B, C, D and uses a 
chi-square test to decide which is the most probable 
of the three possible relationships, viz. A + B & 
C+DorA+C&B+DorA+D&B+C. 
Thus representatives of each part of a divided 
Group were tested with representatives of the Groups 
with which they most closely clustered. These tests 
gave no further grounds for doubting the integrity 
of Groups 8 and 15 and consequently, in the next 
stage of the analysis, they were included un- 
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changed. The tests reinforced the doubts about 
Groups 14, 22, and 24, so their separate parts 
were added to the list of uncertain families to be 
incorporated later. These were: from Group 14, 
Mimosaceae plus Papilionaceae on the one hand 
and Caesalpiniaceae on the other; from Group 22, 
Convolvulaceae plus Polemoniaceae on the one 
hand and Solanaceae on the other; from Group 
24, Valerianaceae and Caprifoliaceae. Tests with 
Group 5 were equivocal, so Hamamelidaceae was 
removed and added to the list of uncertain families, 
but the node for the remaining three families was 
used at the next stage. 


SECOND STAGE; DERIVING A PRELIMINARY, 
ABBREVIATED TREE 


Following the first stage, the basal node was 
used to represent each of the 22 remaining Groups 
(though amended in Group 5 after removal of Ham- 
amelidaceae). Several analyses, using mhennig and 
bb, were carried out on these nodes. The object 
was to identify apparently constant associations 
from which nodes might be derived in order to 
reduce the number of taxa to 16, a number com- 
patible with analyses using the reliable ie program 
at the next stage. The following five pairings were 
chosen and their nodes derived: Groups 2 and 3; 
Groups 6 and 16; Groups 7 and 9; Groups 11 and 
19; Groups 21 and 23. Group 25 was omitted at 
this stage because it was small, well-defined by 
morphology and our own work and could be in- 
corporated later in the same way as the uncertain 
taxa. The resulting tree of 16 taxa is shown in 
Figure 29. 


THIRD STAGE; INCORPORATING TAXA OF UNCERTAIN 
AFFINITIES 


At this point there were 28 taxa on the list of 
those with uncertain affinities, comprising 18 fam- 
ilies that were not placed in a Group (see Table 1 
but note that Piperaceae and Nelumbonaceae were 
considered earlier), two genera (Humulus and 
Grewia) excluded from Groups during their anal- 
ysis, five families and two pairs of families excluded 
during the first stage, and Group 25 omitted at the 
second stage. We wanted to add these into the 
second stage tree as accurately as possible using 
the ie program. These analyses with 17 taxa could 
each be performed in about a day. Although there 
was some variation, the second stage tree remained 
reasonably stable during these analyses, and we 
noted where each uncertain taxon fit. Six joined 
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in the basal third, 14 in the middle third, and 16 
in the distal third. (The nonadditivity reflects that 
rigid demarcation was not exercised and borderline 
taxa were placed in two sets.) The members of each 
of the three sets were then analyzed with the cor- 
responding members of the second stage tree and 
possible new or amended Groups were identified. 


FOURTH STAGE; REDEFINITION OF GROUPS 


Putative new or amended Groups were tested 
extensively to ensure that they were real. In this 
process an important factor in determining the 
coherence of Groups was the length of the inter- 
node joining a hitherto uncertain member to the 
Group. Penny et al. (1987) have emphasized that 
“long edges attract," and we have long been aware 
that the junction of a distantly connected taxon is 
subject to so much variation that it is scarcely 
reliable. Thus, we have usually rejected a potential 
new member of a Group if it joins with a dispro- 
portionately long internode and have left it as un- 
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had dissolved. 


In three cases, definite hypotheses arising from 
our work could be tested. In each case the question 
was whether a taxon belonged to the Group to 
which it was initially assigned (Table 1) or to the 
Group indicated by stage 3. This could be answered 
by considering the lengths of the alternative trees. 
In two cases, Humulus and Hamamelidaceae, the 
new grouping was shorter and therefore preferred. 
For Grewia, the trees were the same length so 
there was no good reason for preferring the new 
grouping (with Group 18). 

As a consequence of these tests, only 15 of the 
original 25 Groups have the same composition as 
they had before the first stage of this section. The 
other ten Groups have been increased, decreased, 
or merged. Where a nucleus of an original Group 
remains, the number has been retained but A add- 
ed. Original Groups 13, 24, and 25 have disap- 
peared. New Groups 26, 27, 28, and 29 have been 


formed. 


Humulus has been removed. 


Group 4A. 


BASE(1) 


The provisional tree of Group nodes abbreviated by combining some Groups and omitting others that 
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Group 54. Hamamelidaceae has been re- 
moved. 


Group 84. Although the original Group 8 
("Centrospermae") remains intact, Lecithydaceae 


and Humulus join the same branch of the tree (Fig. 
30). 
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Group 124. At all stages, Convolvulaceae and 
Polemoniaceae grouped separately from Solana- 
ceae, the other member of Group 22. At the third 
stage, along with Polygonaceae, they clustered with 
Group 12 (Fig. 31). Polemoniaceae and Ericaceae 
are confused, but the other three families grouped 
appropriately. 


Group 14A. The second stage tests suggested 
that the legumes should be divided between Caesal- 
piniaceae on the one hand and Mimosaceae and 
Papilionaceae on the other; Caesalpiniaceae clus- 
tered with Group 13. A series of Lake tests (see 
“First stage” above and Martin & Dowd, 1990) 
was therefore performed. These tests strongly in- 
dicated, first, that Caesalpiniaceae was closer to 
Rosaceae than to either of the other two legume 
families and, second, that Mimosaceae and Papilio- 
naceae were closer to other Groups (e.g., Connar- 
aceae in Group 17, Chrysobalanaceae in Group 
18A below) than were Caesalpiniaceae and Rosa- 
ceae. 

Other second stage tests had indicated that Pro- 
teaceae, Coriaria, Crossosoma, and Hamameli- 


. daceae were also linked to the complex of Groups 
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FIGURE 31. Group 12A. 
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FIGURE 32. (A) Group 14A. For three legume families 
see Figure 17, and for Proteaceae see (B). 


all analyzed together is shown in Figure 32a. The 
Proteaceae node was derived from Figure 32b, 
while the Mimosaceae- Papilionaceae node is node 


3 of Figure 17a. 


Group 184. Third stage tests suggested that 
the Chrysobalanaceae and Vitaceae might cluster 


with Group 18 and also with Group 25 (Apiaceae 
and Araliaceae). Incorporation of these (Fig. 33) 
does nothing to repair the previous (Fig. 21) dis- 
junction of Haloragaceae while Rhizophoraceae s.l. 
remain apart from Anisophyllea. 


Group 224. With the other two families join- 
ing Group 12A (above), the Solanaceae were left 
as the sole representative. 


Group 26. “This new Group (Fig. 34a) consists 
of three families (Campanulaceae, Caprifoliaceae 
and Goodeniaceae), each with well-paired repre- 
sentatives. With them is Asteraceae, the node for 


which is derived from Figure 34b. 


Group 27. This comprises the families Elaeag- 
naceae and Rhamnaceae, the members of which 


form pairs (Fig. 35). 


Group 28. As noted below, Buxus does not 
pair with Simmondsia, which is sometimes placed 
in Buxaceae. While the latter clusters with Eu- 
phorbiaceae (Fig. 36), Buxus does not. 


Group 29. The species of Hydrophyllaceae, 
Thymelaeaceae, and Valerianaceae form pairs in 


this new Group (Fig. 37). 
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There are three families for which we have no 
acceptable hypothesis. (a) Loasaceae. It was un- 
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fortunate that we failed to obtain a sequence for 
Centranthus 


Eucnides bartonioides because this left Mentzelia 
d as a singleton and therefore with a “long edge” 
= 

o 


that joined unreliably. (b) Plumbaginaceae. Al- 


though the two representatives 


Valeriana 


, Limonium and 
Plumbago, paired well, there remained a very long 


internode joining the family to the tree, and so we 
have left it unplaced. (c) Buxaceae. Originally both 
Buxus and Simmondsia were chosen as represen- 
tatives of Buxaceae (s.l.), but they proved quite 
different and, since there was taxonomic opinion 
FIGURE 37. Group 29. 
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FIGURE 38. The overall tree for the dicotyledons. Groups are numbered and their constituent families indicated 


underlined 


using the three-letter acronyms of Weber (1982), given in Table 2. Families in which nitrogen-fixation is known are 
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MEAN OF GROUP MEANS 


1 2 3 22A8A 15 5A 11 17 1912A 20 23 4 10 6 16 28 14A 9 27 7 18A 29 26 21 


Groups are arranged along the X axis in the order that they depart from the trunk of the overall 


tree (Fig. 38). Solid dots are the mean distances of species of that Group from the angiosperm origin, and bars 


indicate the range from smallest to greatest. 


Simmondsia grouped reasonably well with Eu- 
phorbiaceae, Buxus did not and remains unplaced. 


FIFTH STAGE; THE SIMULTANEOUS 
ANALYSIS OF REVISED GROUPS 


Initially, the nodes of all 26 revised Groups were 
analyzed using the option mhennig followed by bb, 
and the resulting tree was divided into a top, middle, 
and bottom section. Thus, with overlaps, each con- 
tained 14 taxa, a number that could be analyzed 
using the ie option. Fortunately, there was no con- 
fusion at the overlaps, and the three parts were 
fitted together to give the overall tree (Fig. 38). 


DISCUSSION 


THE RATE OF EVOLUTION AND THE 
AGE OF THE ANGIOSPERMS 


In Figure 38, which shows the overall tree for 
the dicotyledons, there is a “trunk” from which 
branches depart at irregular intervals of up to 5 
i.n.d. In Figure 39, we arranged Groups in the 
order that they branch from the trunk. For every 
species we measured the number of differences (in 
i.n.d.) between it and the base of the angiosperm 
tree (Fig. 1). For each Group we show the mean 
of these distances and also the range from smallest 
to greatest. The mean of all Groups is 16.2 i.n.d. 
We have also analyzed variance and shown that 
there is significant (P « 0.001) variation between 
Groups. Thus, although the difference between a 
slowly evolving Group such as Group 3 (mean 14.1) 
and a rapidly evolving Group such as Group 21 
(19.7) is not great, it is probably real. 

The age of the dicotyledons can be derived from 
the product of the mean number of differences of 
species from base and the rate of evolution. Since 


Figure 6 suggests that the monocotyledons are 
derived from the dicotyledons this is also the age 
of the angiosperm. Martin and Dowd (1988) es- 
timated the rate to be 1 i.n.d. in 14 Ma for a single 
evolutionary line. However, this estimate was based 
on members of the Fagaceae, Proteaceae, Sola- 
naceae, and Winteraceae, all of which belong to 
Groups that evolve more slowly than average; their 
mean number of differences from base is 14.7 i.n.d. 
Thus, the inferred age of the angiosperms is 14 x 
14.7 — 205 Ma, that is, at the beginning of the 
Jurassic. Crane et al. (1989) and Wolfe et al. 
(1989) have estimated the age of the angiosperms 
as 200 Ma. If the monocotyledons are indeed de- 
rived from the dicotyledons, there is good agree- 
ment. 


THE RELIABILITY OF OUR TREES 


The current limitations of computers and com- 
puting programs make it impossible to conduct a 
large phylogenetic analysis in a completely objec- 
tive manner. Our first important deviation from 
objectivity has been accepting taxonomic opinion 
that species belong to the same family. Our second 
has been seeking a consensus in placing these into 
Groups. 

The assumption of correct assignment to families 
is strongly supported by the correct pairing (or 
formation of clusters of three when appropriate) 
shown in the objectively derived figures of the final 
26 Groups. Of the 95 families with two or three 
representatives, only 11 had disjunct representa- 
tives and, of these, at least four were families sensu 
lato with taxonomic opinions that they should really 
be split. These are the separation of Humulus from 
Morus in Group 4, of Flindersia from other Ru- 
taceae (Group 17), of Buxus from Simmondsia 
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(Group 28), and of Anisophyllea from other Rhi- 
zophoraceae (Group 18). When it is further con- 
sidered that one aberrant species can disrupt two 
families, we submit that the high proportion of 
correct grouping is strong evidence not only for 
the correctness of other taxonomy at this level but 
also of the soundness of our approach. 

If our methods, while not perfect, are good at 
the level of placing taxa into families, is there any 
reason why they should not be equally acceptable 
at higher levels? We have investigated this with 
the assumption that the probability of errors will 
increase as internode lengths decrease. From each 
of the Group trees we have determined that the 
average length of internodes within families (re- 
stricting the measurements to families with only 
two correctly paired representatives) is 5.6 i.n.d., 
while the average length of internodes between 
families is 4.7. From the final tree showing the 
relationships of Groups (Fig. 38), the average length 
of internodes is 3.0. Thus, if our assumption is 
valid, the ratio 5.6: 4.7: 3.0 should reflect the re- 
liability of arranging species within families, fam- 
ilies within Groups, and Groups in the final tree. 
We suggest caution about accepting relationships 
as the taxonomic level increases. 

There is no obvious reason why the ratio just 
reported should not be similar for other macro- 
molecular sequences. However, with nucleic acid 
sequencing (see review by Palmer, 1988) the 
amount of information available might increase by 
an order of magnitude over that presented here; 
thus, even if internode lengths at the highest levels 
are still proportionately small, the probability of 
errors due to chance when using small numbers 
should diminish and lead to more decisive phylog- 
enies. 


THE VALUE OF THIS STUDY 


We believe that the demarcation of plant taxa 
at all levels should be the prerogative of botanists 
with a broad background in taxonomy and that the 
same specialists are best suited to compare the 
results of this study, expressed as phylogenetic 
trees, with published phylogenies. Because we do 
not have that background, we resist the temptation 
to point out the similarities and differences that we 
perceive and to assess when our trees are likely to 
be incorrect. Our perceptions are likely to be un- 
balanced. 


One difference between this phylogenetic study 
and most others is that it is repeatable. Without 
detracting from the value of published angiosperm 
phylogenies, they do seem to depend on the ac- 
cumulated wisdom and experience of rare individ- 
uals whose relevant brain functions are not easily 
transmitted in entirety. On the other hand, anyone 
who follows our procedures should arrive at the 
same phylogenetic trees. More to the point, with 
improved analytical procedures it is possible that 
more acceptable endpoints may be reached. 

We have avoided the word “conclusions” be- 
cause we do not claim that this work is definitive. 
Rather it has led to new working hypotheses which, 
we hope, others will test with more extensive sam- 
pling and more data including much longer se- 
quences. To such investigators our analytical meth- 
od, whether perceived as successful or not, may 
be a useful example. 


NATURAL SELECTION AND THE 
EVOLUTION OF RUBISCO-SSU 


Under “General remarks about the sequences,” 
we discussed heterogeneity within species and quot- 
ed the evidence of Pichersky et al. (1986) that 
natural selection acts to keep the amino acid se- 
quence constant. Below we present other evidence 
for the importance of natural selection. 

Under “Methods of Data Analysis,” we dis- 
cussed Lake’s test, which is based only on trans- 
versions (mutations from a purine to a pyrimidine 
or vice versa) and ignores transitions (purine to 
purine or pyrimidine to pyrimidine). Lake (1987) 
quoted evidence (Brown et al., 1982) that in animal 
mitochondrial DNAs, transitions occur an order of 
magnitude more frequently than transversions. 
Zimmer et al. (1989) have found for higher plant 
cytoplasmic rRNA that, on average, transitions 
were twice as frequent as transversions with the 
lowest ratio in the most invariant regions. We have 
investigated this in 44 families of Groups 1 to 10 
and have scored those amino acid changes within 
families that can be ascribed unequivocally to trans- 
versions and transitions. There were 123 transi- 
tions and 306 transversions, a proportion of 0.287 
transitions and therefore quite different from the 
evidence just quoted. 

We have considered each of the 61 codons in 
the genetic code and, assuming that each nucle- 
otide can change to another with the same prob- 
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ability, calculated the frequencies of the four pos- 
sibilities, i.e., transition causing amino acid change, 
transversion causing amino acid change, no amino 
acid change, and lethality (stop). Thus, for the two 
codons that determine phenylalanine the ratio of 
transitions to transversions is 0.25, for eight of the 
amino acids it is 0.33, and the ratio varies from 
0.14 to 0.34 with an average of 0.2845. Averaging 
the 31 variable amino acids in the top line of Table 
3 gives the ratio 0.268, which may be compared 
with the observed figure of 0.287. This suggests 
that, at the nonsilent positions, which are the only 
ones we are able to consider, there is close to 
randomness with respect to the occurrence of tran- 
sitions and transversions. 

We suggest that the large discrepancy between 
our result and the expectations from chemistry and 
nucleic acid sequencing is due partly to our inability 
to score silent substitutions and partly to the over- 
whelming importance of natural selection in de- 
termining the amino acid sequence of an important 
enzyme. Even though most nucleotide substitutions 
are presumably transitions, this has little effect on 
the final outcome, the amino acid sequence, on 
which natural selection can act. 

Other evidence of strong natural selection comes 
from consideration of variation at positions like 8 
and 9. At position 8, 84% of species have glycine 
and 15% asparagine. This substitution requires at 
least two nucleotide changes so, in the absence of 
selection, the single-change intermediates serine or 
lysine would often be expected, though they have 
not been observed. Similarly, at position 9, 56% 
of species have leucine and 28% lysine. Again, this 
is a two-nucleotide change, but the only single- 
change intermediates found are methionine and 
isoleucine, and these are much too rare to occur 
randomly. Apparently, glycine and asparagine are 
“adaptive peaks” at position 8 and leucine and 
lysine are at position 9. When positions 8 and 9 
are considered together, there is a small excess 
over chance expectations of the combinations gly- 
cine-leucine and asparagine-lysine; these may be 
adaptive peaks because both combinations are found 
within Tiliaceae (Group 9), Papilionaceae (Group 
14A), Apocynaceae (Group 20), Proteaceae (Group 
14A), and different families of Group 15. Clearly, 
convergent evolution has occurred. 

This last evidence suggests that adjacent posi- 
tions influence one another, which is known. An- 
other example is probably found in the Onagraceae, 


the only family with N-terminal phenylalanine and, 
alongside it, asparagine, again only found in On- 
agraceae. Solanum species with the same rare 
substitutions at positions 15 and 21 are examples 
that the effect can extend further. Another example 
concerns positions 30 and 39, both of which are 
almost always either valine (V) or isoleucine (I). 
The frequencies within species of the four possible 
combinations (VV, VI, IV, II) indicate that the two 
positions evolve independently; nevertheless, they 
are different in the monocotyledons, with 67.596 
isoleucine, and the dicotyledons, with 35.446 iso- 
leucine. It is conceivable that monocotyledons are 
richer in isoleucine because they have a more ef- 
ficient synthetic pathway for isoleucine so that, in 
the absence of other strong selective forces, the 
substitution of isoleucine for valine may be favored. 

Despite the evidence that natural selection is 
acting strongly, there are few decisive changes, 
such as the change from proline to isoleucine at 
position 6 during the evolution of the monocoty- 
ledons. At positions 7 and 8, the combination ty- 
rosine-asparagine occurs in the gymnosperms, 
Groups 1, 2, and 3, but in no other Groups, sug- 
gesting that these amino acids are primitive. How- 
ever, the distinction between primitive and ad- 
vanced is usually equivocal; for the following 
example, normal taxonomic criteria have been used 
to distinguish 58 primitive genera (those in the 
gymnosperms, Piperaceae, Nelumbonaceae, and 
Groups 1, 2, 3, and 5) from 67 advanced genera 
(those in Asteraceae, Campanulaceae, Goodeni- 
aceae, Hydrophyllaceae, and Groups 10, 20, 21, 
22, 23, and 24). At position 12, tyrosine occurred 
in 5% of primitive and 31% of advanced genera 
while at position 20 aspartic acid occurred in 5% 
of primitive and 43% of advanced genera. While 
admitting that the sampling is not entirely satis- 
factory, it appears that tyrosine at position 12 and 
aspartic acid at position 20 are advanced. How- 
ever, the important point is that the divergence is 
so indecisive, the primitive amino acids phenylal- 
anine at position 12 and proline at position 20 still 
occurring in the majority of genera in all advanced 
Groups. 

If it is correct that natural selection acts strongly 
to determine the amino acid sequence of a protein, 
this could be important in considering “molecular 
evolutionary clocks.” If the clock that is considered 
is derived from nucleic acid sequences, the rare 
event that is the basis of regression of number of 
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differences on time is nucleotide substitution, the 
most common form of mutation and not always 
subject to natural selection. If, however, the clock 
is derived from amino acid sequences, the rare 
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